The aim of this project is to predict whether a NRFI will occur in any given MLB game. We will be using data that spans three MLB datasets all found on data.world. Let’s get into it!
NRFI, short for “No Run First Inning,” has quickly gained popularity as a thrilling betting option among baseball enthusiasts. The concept behind this bet is pretty straightforward. With a baseball game divided into nine innings, each team gets a chance to score runs while batting. An inning concludes once both teams have made three outs. Now, the NRFI bet centers solely on the first inning, challenging bettors to predict whether either team will score a run during that initial frame.
While the NRFI may not be an officially recognized statistic by the MLB, we’re interesting in seeing if certain pitching and batting statistics can serve as good predictors for if a NRFI will occur or not. Let’s dive deeper into the NRFI, which is widely considered the greatest bet on the diamond.
Great, now that we know what a NRFI is, how do we predict one? We will be using data that spans three MLB datasets to create a final dataset for our models. This process will involve data cleaning and manipulation to ensure our final dataset includes only relevant observations and predictors useful in predicting NRFIs. Next, we will perform an Exploratory Data Analysis to gain insights into which predictors could be valuable. Following that, we will split the data into training and testing sets, build a recipe, and establish folds for the 5-fold cross-validation implementation. In this project, we will develop four different models on the training data: Logistic Regression, Linear Discriminant Analysis, Quadratic Discriminant Analysis, and Random Forest. Finally, we will assess the performance of each model and determine which one works best for predicting NRFIs. Let’s play ball!
In order to create the best model, we need to clean up the original raw data. Because the final dataset will be a combination of 3 raw datasets, there will also be a lot of tidying to make the data usable.
The first dataset we will be looking at will be MLB game logs from 1871 - 2016. Lets take a look.
game_logs <- read.csv("data/game_logs.csv")
colnames(game_logs)
## [1] "date" "number_of_game"
## [3] "day_of_week" "v_name"
## [5] "v_league" "v_game_number"
## [7] "h_name" "h_league"
## [9] "h_game_number" "v_score"
## [11] "h_score" "length_outs"
## [13] "day_night" "completion"
## [15] "forefeit" "protest"
## [17] "park_id" "attendance"
## [19] "length_minutes" "v_line_score"
## [21] "h_line_score" "v_at_bats"
## [23] "v_hits" "v_doubles"
## [25] "v_triples" "v_homeruns"
## [27] "v_rbi" "v_sacrifice_hits"
## [29] "v_sacrifice_flies" "v_hit_by_pitch"
## [31] "v_walks" "v_intentional.walks"
## [33] "v_strikeouts" "v_stolen_bases"
## [35] "v_caught_stealing" "v_grounded_into_double"
## [37] "v_first_catcher_interference" "v_left_on_base"
## [39] "v_pitchers_used" "v_individual_earned_runs"
## [41] "v_team_earned_runs" "v_wild_pitches"
## [43] "v_balks" "v_putouts"
## [45] "v_assists" "v_errors"
## [47] "v_passed_balls" "v_double_plays"
## [49] "v_triple_plays" "h_at_bats"
## [51] "h_hits" "h_doubles"
## [53] "h_triples" "h_homeruns"
## [55] "h_rbi" "h_sacrifice_hits"
## [57] "h_sacrifice_flies" "h_hit_by_pitch"
## [59] "h_walks" "h_intentional.walks"
## [61] "h_strikeouts" "h_stolen_bases"
## [63] "h_caught_stealing" "h_grounded_into_double"
## [65] "h_first_catcher_interference" "h_left_on_base"
## [67] "h_pitchers_used" "h_individual_earned_runs"
## [69] "h_team_earned_runs" "h_wild_pitches"
## [71] "h_balks" "h_putouts"
## [73] "h_assists" "h_errors"
## [75] "h_passed_balls" "h_double_plays"
## [77] "h_triple_plays" "hp_umpire_id"
## [79] "hp_umpire_name" "X1b_umpire_id"
## [81] "X1b_umpire_name" "X2b_umpire_id"
## [83] "X2b_umpire_name" "X3b_umpire_id"
## [85] "X3b_umpire_name" "lf_umpire_id"
## [87] "lf_umpire_name" "rf_umpire_id"
## [89] "rf_umpire_name" "v_manager_id"
## [91] "v_manager_name" "h_manager_id"
## [93] "h_manager_name" "winning_pitcher_id"
## [95] "winning_pitcher_name" "losing_pitcher_id"
## [97] "losing_pitcher_name" "saving_pitcher_id"
## [99] "saving_pitcher_name" "winning_rbi_batter_id"
## [101] "winning_rbi_batter_id_name" "v_starting_pitcher_id"
## [103] "v_starting_pitcher_name" "h_starting_pitcher_id"
## [105] "h_starting_pitcher_name" "v_player_1_id"
## [107] "v_player_1_name" "v_player_1_def_pos"
## [109] "v_player_2_id" "v_player_2_name"
## [111] "v_player_2_def_pos" "v_player_3_id"
## [113] "v_player_3_name" "v_player_3_def_pos"
## [115] "v_player_4_id" "v_player_4_name"
## [117] "v_player_4_def_pos" "v_player_5_id"
## [119] "v_player_5_name" "v_player_5_def_pos"
## [121] "v_player_6_id" "v_player_6_name"
## [123] "v_player_6_def_pos" "v_player_7_id"
## [125] "v_player_7_name" "v_player_7_def_pos"
## [127] "v_player_8_id" "v_player_8_name"
## [129] "v_player_8_def_pos" "v_player_9_id"
## [131] "v_player_9_name" "v_player_9_def_pos"
## [133] "h_player_1_id" "h_player_1_name"
## [135] "h_player_1_def_pos" "h_player_2_id"
## [137] "h_player_2_name" "h_player_2_def_pos"
## [139] "h_player_3_id" "h_player_3_name"
## [141] "h_player_3_def_pos" "h_player_4_id"
## [143] "h_player_4_name" "h_player_4_def_pos"
## [145] "h_player_5_id" "h_player_5_name"
## [147] "h_player_5_def_pos" "h_player_6_id"
## [149] "h_player_6_name" "h_player_6_def_pos"
## [151] "h_player_7_id" "h_player_7_name"
## [153] "h_player_7_def_pos" "h_player_8_id"
## [155] "h_player_8_name" "h_player_8_def_pos"
## [157] "h_player_9_id" "h_player_9_name"
## [159] "h_player_9_def_pos" "additional_info"
## [161] "acquisition_info"
dim(game_logs)
## [1] 171907 161
This dataset provides game logs from 171,907 different games along with 161 possible predictors. Wow! Obviously we won’t need all of this data so let’s try to trim this set down. The two most important predictors in this dataset are the v_line_score and h_line_score (visiting and home line scores). Here are the v line scores from the first 6 games in the dataset.
head(game_logs$v_line_score)
## [1] "000000000" "107000435" "610020003" "101403111" "000002232" "12120534"
What do these numbers mean? This value is the amount of runs the visiting team scored each inning. For example, “107000435” means the visiting team scored 1 in the first inning, 0 in the second inning, 7 in the third inning, etc. If we take the first digit of the line score for the visiting and home team, we can create a predictor called “RFI” which is a factor variable that states whether or not a run was scored in the first inning of that game.
vline <- game_logs$v_line_score
v_first <- ifelse(substr(vline, 1, 1) != "(",as.numeric(substr(vline, 1, 1)), as.numeric(substr(vline,2,3)))
hline <- game_logs$h_line_score
h_first <- ifelse(substr(hline, 1, 1) != "(",as.numeric(substr(hline, 1, 1)), as.numeric(substr(hline,2,3)))
first <- v_first+h_first
rfi <- ifelse(first > 0, "yes", "no")
Here, we create a variable called first, which is the sum of the first inning score of both teams. The RFI variable simply states whether that value is greater than 0 (yes) or not (no).
head(data.frame(vline,hline,rfi),10)
## vline hline rfi
## 1 000000000 010010000 no
## 2 107000435 640113030 yes
## 3 610020003 010020100 yes
## 4 101403111 077000000 yes
## 5 000002232 101003000 yes
## 6 12120534 01410004 yes
## 7 141020004 004100012 yes
## 8 053210012 200002001 yes
## 9 030100101 003300123 no
## 10 302604(11)30 610020221 yes
In terms of useful numerical data, that is about as good as it gets for the game logs. This dataset does not provide any batting or pitching statistics. Luckily, we have two other datasets that will provide that for us. For now, lets add some useful categorical predictors that the game_logs provide. These include: pitcher, team, and season IDs.
vpitch <- game_logs$v_starting_pitcher_id
vpitcher <- substr(vpitch,1,5)
hpitch <- game_logs$h_starting_pitcher_id
hpitcher <- substr(hpitch,1,5)
vteam <- game_logs$v_name
hteam <- game_logs$h_name
date <- substr(game_logs$date,1,4)
test_logs <- data.frame(date,rfi,vpitcher,hpitcher,vteam,hteam)
With a dataset of 171,907 rows, there is a lot of missing data. If an observation is missing a home or visiting line score, we can’t determine the RFI and the data point is useless. Because our dataset is so big, we can afford to simply remove any observations that have missing line scores.
final_logs <- test_logs[!is.na(test_logs$rfi),]
And with that, lets take a look at a random data point in our game logs set!
final_logs[5739,]
## date rfi vpitcher hpitcher vteam hteam
## 30371 1909 yes camnh froma PIT CIN
dim(final_logs)
## [1] 147271 6
We now have access to the pitchers, teams, and RFI of 147,271 baseball games. Now what? We need actual numerical predictors that can help predict a NRFI. This seems like a good time to load in our 2nd dataset, the pitching dataset.
pitching_data <- read.csv("data/pitching.csv")
colnames(pitching_data)
## [1] "playerID" "yearID" "stint" "teamID" "lgID" "W"
## [7] "L" "G" "GS" "CG" "SHO" "SV"
## [13] "IPouts" "H" "ER" "HR" "BB" "SO"
## [19] "BAOpp" "ERA" "IBB" "WP" "HBP" "BK"
## [25] "BFP" "GF" "R" "SH" "SF" "GIDP"
This dataset gives the season stats of every pitcher in every season from 1871 to 2016.
Here is a list of major baseball stats that are used in evaluating
how good a pitcher is:
ERA: Earned Run Average. How many runs a batter scores
on this pitcher over 9 innings on average.
WHIP: Walks/Hits per Innings Pitched. The number of
batters the pitcher allowed to get on base per inning.
SOP: Strikeout percentage. The percentage of batters a
pitcher has struck out.
HRP: Home run percentage. The percentage of batters a
pitcher has given up a home run.
Let’s calculate these stats from the data and create a final pitching dataset.
head(final_pitch)
## pitch year pteam era whip sop hrp
## 1 bechg 1871 PH1 7.96 0.6923077 0.006849315 0.000000000
## 2 braia 1871 WS3 4.50 0.5025253 0.010069713 0.003098373
## 3 fergb 1871 NY2 27.00 2.6666667 0.000000000 0.000000000
## 4 fishc 1871 RC1 4.35 0.5101721 0.013888889 0.002777778
## 5 fleef 1871 NY2 10.00 0.8518519 0.000000000 0.000000000
## 6 flowd 1871 TRO 0.00 0.3333333 0.000000000 0.000000000
dim(final_pitch)
## [1] 42007 7
This data is only 42,007 columns. This makes sense because these are season averages not single game statistics. How are we going to add this to our game logs dataset with nearly 150,000 observations? It is a bit tedious. We need go through every game in the game logs and check for the team, year, and pitching id. Then we can look for a match in the pitching dataset and add the stats for each game. If there are no matches or multiple matches (there are duplicate pitching ids), we can remove that game from our game logs. The game logs will hold two of each pitching stat, one for each team’s pitcher.
Let’s first make sure to remove any games with multiple or no matches to the pitching dataset.
nrow(final_logs)
## [1] 147271
final_data <- final_logs
valid <- c()
for(i in 1:nrow(final_data)) {
vpi <- final_data[i,3]
vteamv <- final_data[i,5]
yearv <- final_data[i,1]
x <- nrow(filter(final_pitch, pitch == vpi, pteam==vteamv,year==yearv)) == 1
valid <- append(valid,x)
}
validdf <- data.frame(valid)
final_data["valid"] <- validdf
final_data_2 <- final_data[final_data$valid,]
final_data_2
The code chunk above is going through every game and only marking the games with exactly one match as valid.Then we remove any invalid observations from the dataset. Our dataset is now ready for the pitching stats. Before we add those, let’s look take a look at the batting stats database.
teams_data <- read.csv("data/teams.csv")
colnames(teams_data)
## [1] "yearID" "lgID" "teamID" "franchID"
## [5] "divID" "Rank" "G" "Ghome"
## [9] "W" "L" "DivWin" "WCWin"
## [13] "LgWin" "WSWin" "R" "AB"
## [17] "H" "X2B" "X3B" "HR"
## [21] "BB" "SO" "SB" "CS"
## [25] "HBP" "SF" "RA" "ER"
## [29] "ERA" "CG" "SHO" "SV"
## [33] "IPouts" "HA" "HRA" "BBA"
## [37] "SOA" "E" "DP" "FP"
## [41] "name" "park" "attendance" "BPF"
## [45] "PPF" "teamIDBR" "teamIDlahman45" "teamIDretro"
We are specifically interested in two batting stats:
BAVG: Batting average. A batting teams total hits
divided by their at-bats.
SLP: Slugging Percentage. Measures a batting teams
efficiency. The formula takes into account exactly how many bases were
taken per hit.
Notice that these stats are team averages, not single player. While
there is only one pitcher for a team in the first inning, there can be
any number of batters and in various orders. For batting statistics, it
makes sense to use team averages. Let’s take a look at our final batting
dataset.
tyear <- teams_data$yearID
team <- teams_data$teamID
thits <- teams_data$H
tabs <- teams_data$AB
twalks <- teams_data$BB
hbp <- teams_data$HBP
sf <- teams_data$SF
sb <- teams_data$X2B
tb <- teams_data$X3B
thr <- teams_data$HR
fb <- thits - (sb+tb+thr)
bavg <- thits/tabs
slp <- (fb+(2*sb)+(3*tb)+(4*thr))/(tabs)
test_teams <- data.frame(tyear,team,bavg,slp)
final_teams <- test_teams[test_teams$tyear < 1970 | test_teams$tyear > 1979,]
head(final_teams)
## tyear team bavg slp
## 1 1871 BS1 0.3104956 0.4220117
## 2 1871 CH1 0.2700669 0.3737458
## 3 1871 CL1 0.2765599 0.3912310
## 4 1871 FW1 0.2386059 0.2935657
## 5 1871 NY2 0.2870370 0.3497151
## 6 1871 PH1 0.3200625 0.4348165
dim(final_teams)
## [1] 2619 4
Similar to pitching, we will go through every game and look for a matching team and year to find the stats for that game. We do not have to check for valid entries anymore because there is no missing or duplicate data in this set. After adding all the stats, pitching and batting, we have a complete dataset.
head(data1)
## X date rfi vpitcher hpitcher vteam hteam vera hera vwhip hwhip
## 1 1 1871 no prata mathb CL1 FW1 3.77 5.17 0.5089021 0.5562130
## 2 2 1871 yes spala braia BS1 WS3 3.36 4.50 0.4805699 0.5025253
## 3 3 1871 yes prata fishc CL1 RC1 3.77 4.35 0.5089021 0.5101721
## 4 4 1871 yes prata zettg CL1 CH1 3.77 2.73 0.5089021 0.4473684
## 5 5 1871 yes spala mcmuj BS1 TRO 3.36 5.53 0.4805699 0.6760375
## 6 6 1871 yes zettg prata CH1 CL1 2.73 3.77 0.4473684 0.5089021
## vsop hsop vhrp hhrp vbavg hbavg vslp
## 1 0.03043868 0.019406393 0.008057296 0.005707763 0.2765599 0.2386059 0.3912310
## 2 0.01798280 0.010069713 0.001563722 0.003098373 0.3104956 0.2771619 0.4220117
## 3 0.03043868 0.013888889 0.008057296 0.002777778 0.2765599 0.2644788 0.3912310
## 4 0.03043868 0.019247594 0.008057296 0.005249344 0.2765599 0.2700669 0.3912310
## 5 0.01798280 0.008995502 0.001563722 0.002998501 0.3104956 0.3076923 0.4220117
## 6 0.01924759 0.030438675 0.005249344 0.008057296 0.2700669 0.2765599 0.3737458
## hslp
## 1 0.2935657
## 2 0.3688101
## 3 0.3638996
## 4 0.3737458
## 5 0.4174679
## 6 0.3912310
dim(data1)
## [1] 111521 19
This is good so far but we need to wrap up a couple more things. Right now we have two separate predictors for visiting and home teams. While both are necessary because both teams plays in the first inning, we do not need to have them as separate predictors. Instead, we can use the average of the home and visiting team for each stat.
head(data2)
## X rfi era whip sop hrp bavg slp
## 1 1 no 4.470 0.5325575 0.02492253 0.006882529 0.2575829 0.3423984
## 2 2 yes 3.930 0.4915476 0.01402626 0.002331048 0.2938287 0.3954109
## 3 3 yes 4.060 0.5095371 0.02216378 0.005417537 0.2705193 0.3775653
## 4 4 yes 3.250 0.4781352 0.02484313 0.006653320 0.2733134 0.3824884
## 5 5 yes 4.445 0.5783037 0.01348915 0.002281111 0.3090940 0.4197398
## 6 6 yes 3.250 0.4781352 0.02484313 0.006653320 0.2733134 0.3824884
dim(data2)
## [1] 111521 8
Ok, we are almost done. While we did account for invalid data within the game logs, we never checked for missing data within the pitching and batting data. Sure enough, we did have missing data within the pitching dataset. Luckily, this only amounted to about 2,000 observations so we were able to remove it from the set. So without further ado, here is the final set.
head(data)
## X rfi era whip sop hrp bavg slp
## 1 1 no 4.470 0.5325575 0.02492253 0.006882529 0.2575829 0.3423984
## 2 2 yes 3.930 0.4915476 0.01402626 0.002331048 0.2938287 0.3954109
## 3 3 yes 4.060 0.5095371 0.02216378 0.005417537 0.2705193 0.3775653
## 4 4 yes 3.250 0.4781352 0.02484313 0.006653320 0.2733134 0.3824884
## 5 5 yes 4.445 0.5783037 0.01348915 0.002281111 0.3090940 0.4197398
## 6 6 yes 3.250 0.4781352 0.02484313 0.006653320 0.2733134 0.3824884
dim(data)
## [1] 109656 8
We have ourselves a dataset! Let’s explore it.
We can create some visualizations to better understand our data and get an idea of how well our predictors will perform when it comes time for modeling. Let’s get into it.
Let’s create a correlation heatmap to spot out relationships between our predictors.
It seems for the most part, there is little to no relationship between
our predictors. A huge exception to that fact is the correlation between
ERA and WHIP. This should not be that surprising given the formula of
both stats. ERA is calculated by
((earned runs given up) * 9 innings)/Innings pitched.
WHIP is calculated by
(Walks + Hits)/Innings Pitched
If we take out the constant “9” as well as Innings pitched which appears
in both equations, we are left comparing the number of runs given up
against the amount of times the pitcher allowed someone on base. This
makes sense that they are correlated because if a pitcher is allowing a
lot of runs, it can be assumed they are allowing a lot of base runners
and vice versa.
Let’s take a look at the distribution of the RFI’s in our data.
We can see that the distribution of RFI is pretty even with the YRFI barely beating out the NRFI. This means our data is not imbalanced and will not be a cause of any inaccuracies we might encounter later on.
As discussed before, Earned Run Average (ERA) is a statistic that measures the amount of earned runs a pitcher is giving up on average over the course of 9 innings. While we would hope to only use first inning stats (pitching stats change drastically from the start of a game compared to the end), ERA can still be useful in evaluating how well the pitcher will play in the first inning. Let’s make a plot to visualize how well ERA could be for our models.
This plot is not so encouraging. It is showing that for every era interval which is 1 era unit long, the distribution of the RFI is just about half. This means that whether our ERA is big or small, it is not really affecting RFI. Hopefully we can get better results with our other predictors.
Walks/Hits per Innings Pitched (WHIP) is a way of determining how many base runners a pitcher is allowing. In baseball, the only way to get on base is if you hit the ball or you get a walk (free base due to pitching error). By using these particular stats, we can see how often a batting team is getting on base and having a higher chance of scoring runs.
Unfortunately, we have a similar predicament where in each interval of
whip, we see an even distribution between YRFIs and NRFIs. In this case,
it isn’t too surprising. We had discovered earlier that the ERA and WHIP
predictors were heavily correlated. It makes sense that both produce
similar distributions of the RFI.
Strikeout %. The percentage of batters a pitcher faced in which he struck him out. In order for a NRFI to occur, we need 3 outs per pitcher. If we can get a pitcher that is able to get a high percentage of batters out without allowing someone to get on base, it significantly raises the chances of a NRFI. That is the significance of the strikeout percentage.
It seems as though this predictor is more of the same. We can see in the
intervals of 0.05 SOP units, we have an even distribution of YRFIs and
NRFIs. It’s time to move on to our final pitching predictor, the home
run percentage.
Home Run %. The formula that we used is more simple than the actual HR% equation but should not differ much when it comes time to model. HR% is the percentage of batters a pitcher allows a home run. Home runs are the number 1 cause for a NRFI to fail because all it takes is one hit for the YRFI to cash. Therefore, we would think pitchers with high HRP would have a smaller amount of NRFIs.
Our final pitching stat also seems pretty split. So far in our EDA, we found that all our pitching stats may not be very good for predicting NRFIs. Let’s take a look a the two batting stats.
BAVG is the batting average of team. The calculation is simple: The team’s total hits divided by their at-bats(the amount of times they batted). The logic here is that if the team is getting hits on a high percentage of their opportunities, they are more likely to score runs and in turn, a YRFI.
No luck with our batting stats. In our intervals with 0.1 BAVG units, we see an even distribution of RFIs. Now it is time for the final stat, slugging percentage.
Slugging %. This stat is a tricky formula that calculates how well a team gets on base. It is basically the Total Bases divided by the teams at bats. To calculate the team’s total bases, you take all their singles plus their doubles times 2 plus their triples times 3 plus their homers times 4. If a team is slugging at a high percentage, they are getting on base more and raising the chances of a YRFI.
Another split distribution. Unfortunately, the EDA was not too encouraging. We found that all of our stats are not able to explain when a RFI is N or Y. Still, lets jump into modeling and see how accurate of a model we can create with the stats we have in our dataset.
We have seen how well the predictors match up to our response variable. It’s now time to create a model that can actually predict whether a NRFI will occur or not. This process will involve making a train/test split, creating a recipe, and form cross-validation within our models.
In order to avoid over-fitting, it is better to train our model on a big chunk of our data, and then test the accuracy with the rest. This will ensure that our model will actually work for data that is not currently present in our dataset. We will set a random seed so the same split will be replicated every time we run the code. We will also make sure to stratify on the RFI variable.
data <- mutate(data, rfi = factor(rfi)) #make sure that RFI is a factor
set.seed(3435)
mlb_split <- initial_split(data, strata = rfi, prop = 0.7)
mlb_train <- training(mlb_split)
mlb_test <- testing(mlb_split)
dim(data)
## [1] 109656 8
We see based on the dimensions, that the training and testing data is about 70-30 split. These are sufficient values and we can move on to recipe building.
Now it is time to create a recipe that brings together our predictor and response variables. The recipe is our chance to specify what predictors from our dataset we want to use. We could also turn any categorical predictors into dummy variables as well as handle any predictors with missing data. With that being said, we have cleaned our dataset to the point where we are using all 8 predictors and have no missing data. In addition, none of the predictors are categorial so we won’t have to worry about creating any dummy variables. Below we can see the completed recipe.
mlb_recipe <- recipe(rfi ~ era + whips + sops + hrps + bavgs + slps, mlb_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors())
prep(mlb_recipe) %>%
bake(new_data = mlb_train)
Let’s form a stratified cross validation with 5 folds stratifying on our response variable, RFI.
mlb_folds <- vfold_cv(mlb_train, v = 5,
strata = rfi)
With our setup all done, we can now move on to building the prediction models.
Now it is time to build our models. As stated earlier, we are going to try to fit 4 different models to predict whether or not a NRFI will occur. The 4 models are: Logistic Regression, Linear Discriminant Analysis, Quadratic Discriminant Analysis, and Random Forest. We will use roc_auc as our performance metric for testing the models. For each model we will set up a workflow and add the model and recipe. For our random forest model, we will also need a tuning grid to fine tune the parameters used. We can then fit all the models to our training data and see how well the best model works against our testing data. Let’s create our models.
log_reg <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
log_wkflow <- workflow() %>%
add_model(log_reg) %>%
add_recipe(mlb_recipe)
lda_mod <- discrim_linear() %>%
set_mode("classification") %>%
set_engine("MASS")
lda_wkflow <- workflow() %>%
add_model(lda_mod) %>%
add_recipe(mlb_recipe)
qda_mod <- discrim_quad() %>%
set_mode("classification") %>%
set_engine("MASS")
qda_wkflow <- workflow() %>%
add_model(qda_mod) %>%
add_recipe(mlb_recipe)
rf_class_spec <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_class_wf <- workflow() %>%
add_recipe(mlb_recipe) %>%
add_model(rf_class_spec)
rf_grid <- grid_regular(mtry(range = c(1, 6)),
trees(range = c(200, 600)),
min_n(range = c(10, 20)),
levels = 5)
We now have all our models created and stored. It is time to finally see how well our models performed. We will start off by looking at the autoplot of our Random Forest model and then jump into the actual performance metrics of each model.
The first step in analyzing our results is using the autoplot function in R on our Random Forest model. This function provides visualizations on how our model performs as the parameters change. Let’s create a Random Forest autoplot.
autoplot(tune_class) + theme_minimal()
Our random forest plot needed to tune 3 parameters: mtry, trees, and
min_n. Mtry is the number of variables to randomly
sample as candidates at each split. Trees is the number
of trees in our forest. Min_n is the number of data
points required for the nodes to be split further down the tree. Let’s
take a look at the best values for the parameters we tuned.
show_best(tune_class, n = 1)
## Warning: No value of `metric` was given; metric 'roc_auc' will be used.
## # A tibble: 1 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 200 20 roc_auc binary 0.535 5 0.00110 Preprocessor1_Model1…
Now that we have tuned our parameters, we can create the best version of our Random Forest model.
rf_class_spec_b <- rand_forest(mtry = 1,
trees = 200,
min_n = 20) %>%
set_engine("ranger") %>%
set_mode("classification")
rf_class_wf_b <- workflow() %>%
add_recipe(mlb_recipe) %>%
add_model(rf_class_spec_b)
We have our models. Let’s see how accurate they are.
The metric we will be looking at is roc_auc. We will see how this metric compares across all four models we created.
## Model ROC_AUC
## 1 Logistic Regression 0.5597797
## 2 LDA 0.5589193
## 3 QDA 0.5488933
## 4 Random Forest 0.5329391
In machine learning, ROC AUC (Receiver Operating Characteristic Area Under the Curve) is a performance metric used to evaluate the performance of binary classification models. It measures the ability of a model to distinguish between two classes by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at various classification thresholds. An ROC AUC of 1 means the model can perfectly distinguish between positive and negative instances while a value of 0.5 means the model is randomly guessing. Given the values in the table above, our models did not perform too well in predicting NRFIs. With that being said, the best model we got was Logistic Regression and we will move forward testing how well that model does on the testing dataset.
Here are some more graphics to visualize our ROC_AUC values.
mlb_bar_plot
mlb_lollipop_plot
mlb_dot_plot
Now the final test, seeing how accurate our best model is against the mlb_test data we created before. We will do this by creating a confusion matrix and seeing what percentage of the logistical regression model’s predictions are correct.
log_fit <- fit(log_wkflow, mlb_train)
augment(log_fit, new_data = mlb_test) %>%
conf_mat(truth = rfi, estimate = .pred_class)
## Truth
## Prediction no yes
## no 6808 6048
## yes 9019 11023
Officially, our best model can predict RFIs with 0.542 accuracy! That is pretty bad. Unfortunately, I will probably not be able to use this model for sports betting over the summer.
Although we had a detailed dataset that was put through many
different tests and analyses, we were not able to find a great model to
predict NRFIs in the MLB.
While the results we produced were disappointing, they were not very
surprising. Baseball is a sport that is notoriously difficult to model,
and whether a run will be scored in the first inning can seem just plain
random at times. With that being said, I believe some changes could be
made to this model that would greatly improve its roc auc and accuracy.
All of our statistical predictors were player statistics that spanned
the whole 9 innings. As stated earlier, a baseball player’s performance
in the first inning differs considerably from their performance later in
the game. This is most likely caused by players initially settling into
the game as well as their bodies wearing out as the game goes on. As a
result of this fact, our model could also differ greatly if we were to
use only first-inning stats rather than full-game stats. If I were to do
this project again, it would also make sense to use more
machine-learning models just to make sure we find the best possible
model. However, we could tell right from the EDA that our predictors
were simply not explaining any of the variability in our RFI
statistics.
I thoroughly enjoyed working on this project and found it to be an
exceptional first-time experience with machine learning. The journey
from handling raw data to crafting a predictive model was both enjoyable
and rewarding. As someone who is fascinated by the world of sports
analytics, this project has made me excited for all the possibilities
that lie ahead in this field.